FastAI Lecture 03

Notes
FastAI
History
Neural Network
Theory
Author

Agastya Patel

Published

January 13, 2024

Modified

January 13, 2024

Rectified linear Unit : y = mx + b

Calculating Loss

What are derivatives?

Derivatives define the rate of change for the particular function at that particular point of parameter. > In machine learning key is to know how to change the parameter (weights) of a function to reduce the loss. We can use derivatives as it gives us the understanding of change which would take place on altering weights. Calculus provides derivatives which can help us create gradients of the function - fastbook

Calculating derivatives for weights in NN

For neural networks with lots of weights, we find derivatives for each weight, treating others as constants. In deep learning, “gradients” mean values of a function’s derivatives. PyTorch’s requires_grad_() helps track and calculate these derivatives automatically.

def f(x): return x**2

xt = tensor(3.).requires_grad_()

## Calculating function with the value 
yt = f(xt)
yt
>>tensor(9., grad_fn=<PowBackward0>)

## Asking pytorch to calculate gradient for us
yt.backwards()
# The "backward" here refers to _backpropagation_, which is the name given to the process of calculating the derivative of each layer.

xt.grad
>> tensor(6.)

derivative of f(x) = x^2 is 2*x We found the same value with the xt.grad (gradient)

The gradients only tell us the slope of our function, they don’t actually tell us exactly how far to adjust the parameters. But it gives us some idea of how far; if the slope is very large, then that may suggest that we have more adjustments to do, whereas if the slope is very small, that may suggest that we are close to the optimal value. - fastbook

Loss vs Metric

Aspect Metric Loss
Purpose Difference Drives human understanding of performance Drives automated learning by optimization
Smoothness Requirement Not constrained by smoothness Requires smoothness for meaningful derivative
Optimization vs. Real Goal Reflects actual goals Compromise between real goals and optimization
Calculation Process Provides overall model evaluation Calculated per item, averaged at epoch end
Focus Consideration Primary focus for judging performance Important for automated learning, may not directly represent end goal

Why Batches?

After loss function calculation; When should the system update weights? if loss is calculated for one item it would not be much informational as it would result in imprecise and unstable gradient if loss is calculated for entire dataset it would take very long

Mini Batch

So, we count the average loss for few data items at a time (Mini Batch) BatchSize = Number of items

Batch Size Quality Time Size
Larger more accurate and stable estimate of your dataset’s gradients from the loss function longer time to process  will process fewer mini-batches per epoch

NOTE: We can’t use large batch size due to limitation of GPU memory

Randomization with mini batches

Dataset creates list of input-label tuples which is passed into DataLoaders both in PyTorch and FastAI so that random mini batches can be created

ds = L(enumerate(string.ascii_lowercase))
ds
>> (#26) [(0, 'a'),(1, 'b'),(2, 'c'),(3, 'd'),(4, 'e'),(5, 'f'),(6, 'g'),(7, 'h'),(8, 'i'),(9, 'j')...]

dl = DataLoader(ds, batch_size=6, shuffle=True)
list(dl)
>> [(tensor([17, 18, 10, 22,  8, 14]), ('r', 's', 'k', 'w', 'i', 'o')),
 (tensor([20, 15,  9, 13, 21, 12]), ('u', 'p', 'j', 'n', 'v', 'm')),
 (tensor([ 7, 25,  6,  5, 11, 23]), ('h', 'z', 'g', 'f', 'l', 'x')),
 (tensor([ 1,  3,  0, 24, 19, 16]), ('b', 'd', 'a', 'y', 't', 'q')),
 (tensor([2, 4]), ('c', 'e'))]
Term Meaning
ReLU Function that returns 0 for negative numbers and doesn’t change positive numbers.
Mini-batch A small group of inputs and labels gathered together in two arrays. A gradient descent step is updated on this batch (rather than a whole epoch).
Forward pass Applying the model to some input and computing the predictions.
Loss A value that represents how well (or badly) our model is doing.
Gradient The derivative of the loss with respect to some parameter of the model.
Backward pass Computing the gradients of the loss with respect to all model parameters.
Gradient descent Taking a step in the directions opposite to the gradients to make the model parameters a little bit better.
Learning rate The size of the step we take when applying SGD to update the parameters of the model.